1. Research Question

Provide a research question related to U.S. domestic air travel. Break down the broad question into three smaller research questions.

Which airlines experienced the greatest changes in cancellation rates over the year, and when did these shifts occur?

2. Rationale for the Research Question

Why was this particular research question chosen? What led to the decision to explore this topic?

Our group chose to investigate cancellation rates because airline disruptions significantly increased during the COVID-19 pandemic, especially at major domestic hubs like ATL (Atlanta Hartsfield-Jackson). Understanding when and which airlines experienced the highest cancellation spikes offers insight into how different carriers responded to operational challenges. By examining daily trends in March, we aim to identify not only overall patterns but also specific moments of disruption across airlines, helping us evaluate which carriers were most affected and when those changes occurred.

3. Data Selection

Detail the dataset(s) your group utilized to address the research question. -What were the reasons for selecting this particular dataset? Which specific variables will the group be focusing on?

airline

year, month – to analyze trends over time

Create “cancelled” - gives cancelled flights over time.

Below, please provide the code your group employed to construct and refine the dataset to suit your research needs. Make sure you export the finalized dataset into a .csv format for uploading to Canvas.

plot 1 data manipulation

atl_march <- read_csv("atl_march_2020.csv")
## Rows: 33888 Columns: 19
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr   (4): carrier, tailnum, origin, dest
## dbl  (14): year, month, day, dep_time, sched_dep_time, dep_delay, arr_time, ...
## dttm  (1): time_hour
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(atl_march)
## # A tibble: 6 × 19
##    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##   <dbl> <dbl> <dbl>    <dbl>          <dbl>     <dbl>    <dbl>          <dbl>
## 1  2020     3     1        4           2335        29      119            100
## 2  2020     3     1      455            503        -8      601            642
## 3  2020     3     1      504            507        -3      617            625
## 4  2020     3     1      542            545        -3      722            727
## 5  2020     3     1      556            600        -4      741            753
## 6  2020     3     1      558            600        -2      745            748
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <dbl>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>
# Compute cancellation rates using dep_time as cancellation proxy
cancelledflights <- atl_march %>%
  filter(!is.na(carrier)) %>%
  mutate(cancelled_a = ifelse(is.na(dep_time), 1, 0)) %>%
  group_by(carrier) %>%
  summarize(total_flights = n(), cancellations = sum(cancelled_a), cancel_rate = round(100 * cancellations / total_flights, 2)) %>%
  arrange(desc(cancel_rate))

cancelledflights
## # A tibble: 14 × 4
##    carrier total_flights cancellations cancel_rate
##    <chr>           <int>         <dbl>       <dbl>
##  1 UA                341            89       26.1 
##  2 DL              20676          5021       24.3 
##  3 AA                962           171       17.8 
##  4 F9                386            65       16.8 
##  5 OH                143            24       16.8 
##  6 B6                297            47       15.8 
##  7 YX                583            92       15.8 
##  8 WN               3315           456       13.8 
##  9 9E               3991           520       13.0 
## 10 YV                168            21       12.5 
## 11 AS                 44             5       11.4 
## 12 EV                 97            11       11.3 
## 13 OO               2066           156        7.55
## 14 NK                819            29        3.54

plot 2 data manipulation

# Load airport coordinates
airports <- get_airports()

# Summarize cancelled flights by destination
mapcancelled <- atl_march %>%
  filter(is.na(dep_time)) %>%
  group_by(dest) %>%
  summarise(cancelled = n(), .groups = "drop") %>%
  left_join(airports, by = c("dest" = "faa")) %>%
  filter(!is.na(lat), !is.na(lon)) # remove airports without coordinates

plot 3 data manipulation

# Daily cancellation counts
cancelledDay <- atl_march %>%
  mutate(cancelled_flag = ifelse(is.na(dep_time), 1, 0)) %>%
  group_by(day) %>%
  summarise(cancelled_flights = sum(cancelled_flag), .groups = "drop")

4. Analytical Approach and Visual Findings

Apply skills you learned from the classes to import, tidy, transform, wrangle, and visualize the data.

Your group should present a minimum of three distinct analyses or visualizations to adequately address the research question. Detail the codes used, and provide a clear interpretation of each visualization.

plot 1

# Bar chart: Cancellation Rate by Carrier
p1 <- ggplot(cancelledflights, aes(x = carrier, y = cancel_rate, fill = carrier)) +
  geom_col() +
  geom_text(aes(label = cancel_rate), vjust = -0.5, size = 3) +
  labs(title = "Cancellation Rate by Airline", x = "Airline",y = "Cancellation Rate (%)") +
  theme_minimal()

ggplotly(p1)

plot 2

# Load US states map
states <- map_data("state")
p2 <- ggplot() +
  geom_polygon(data = states, aes(x = long, y = lat, group = group), fill = "white", color = "gray") + 
  geom_point(data = mapcancelled, aes(x = lon, y = lat, size = cancelled), color = "red", alpha = 0.6) +
  scale_size_continuous(range = c(2, 8)) +
  labs(title = "Cancelled ATL Flights by Destination") + coord_fixed(1.3) +
  theme_void()

ggplotly(p2)

plot 3

# Line graph version of daily cancellations
plot_ly(data = cancelledDay, x = ~day, y = ~cancelled_flights, type = "scatter", mode = "lines+markers", line = list(color = "darkblue")) %>%
  layout(title = "Cancelled FLights From ATL", xaxis = list(title = "Day"), yaxis = list(title = "Number of Cancelled Flights"))

5. Implications and Conclusions

Did the findings align with your initial expectations? What key insights have been gleaned from this research? What broader implications do these findings have for the airline industry, travelers, or policymakers?

Yes, our findings aligned with our initial expectations. We knew that a bunch of flights got cancelled once the whole country shut down on around March 15, 2020. Our line graph (“Daily Cancelled Flights From ATL”) showed that cancellations jumped from steadily around zero cancellations per day to upwards of over 600 cancellations, and growing at the end of the month. Our map plot showed where the flights that were cancelled were intended to go to. This made sense because COVID was very bad in major cities, and also, more flights were going to major cities - as we can see here the East appeared worse off for cancelled flights which was a bit of a surprise because I believed the West Coast got hit harder at first. This may be because there are more major airports on the East Coast. Our bar graph() was surprising because I didn’t know United Airlines got the most percent of their flights cancelled. This may be because they had more flights scheduled for later in the month when COVID hit the United States, or it may be that they shut down processes earlier than others. This would be very interesting to do further research on. The bigger airlines got most shut down.